10/19/2025

Scanpy Tutorial: Preprocessing, Dimensionality Reduction & Clustering, & Manual Cell Type Annotation¶

Scanpy is a single-cell analysis tool in Python.

Using a scanpy tutorial: preprocessing and clustering. Also refer to anndata basics.

In [1]:
# Core scverse libraries
import scanpy as sc
import anndata as ad

import hisepy

# only see warnings once
import warnings
warnings.filterwarnings(action='once')
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/VertexPartition.py:413: SyntaxWarning: invalid escape sequence '\m'
  .. math:: Q = \\frac{1}{m} \\sum_{ij} \\left(A_{ij} - \\frac{k_i^\mathrm{out} k_j^\mathrm{in}}{m} \\right)\\delta(\\sigma_i, \\sigma_j),
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/VertexPartition.py:788: SyntaxWarning: invalid escape sequence '\m'
  .. math:: Q = \\sum_{ij} \\left(A_{ij} - \\gamma \\frac{k_i^\mathrm{out} k_j^\mathrm{in}}{m} \\right)\\delta(\\sigma_i, \\sigma_j),
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/Optimiser.py:27: SyntaxWarning: invalid escape sequence '\g'
  implementation therefore does not guarantee subpartition :math:`\gamma`-density.
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/Optimiser.py:346: SyntaxWarning: invalid escape sequence '\s'
  .. math:: Q = \sum_k \\lambda_k Q_k.
In [2]:
# hisepy code for reading the scanpy demo files
samples = {
    "s1d1": "8c4c7e50-4888-4d98-901b-b9356056d1b8",
    "s1d3": "4d420fd6-c2a3-4d96-a6cf-1a2d717c95f4",
}
adatas = {}

for sample_id, uuid in samples.items():
    path = hisepy.cache_files([uuid])[0]
    sample_adata = sc.read_10x_h5(path)
    sample_adata.var_names_make_unique()
    adatas[sample_id] = sample_adata

# using anndata
adata = ad.concat(adatas, label="sample")
adata.obs_names_make_unique()
print(adata.obs["sample"].value_counts())
adata

#
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
sample
s1d1    8785
s1d3    8340
Name: count, dtype: int64
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1791: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
Out[2]:
AnnData object with n_obs × n_vars = 17125 × 36601
    obs: 'sample'

The data contains ~8,000 cells per sample and 36,601 measured genes.

Quality Control¶

calculate_qc_metrics() calculates common quality control (QC) metrics. We can pass specific gene populations to calculate_qc_metrics() to calculate proportions of counts for these populations. Let's define mitochondrial, ribosomal and hemoglobin genes as three distinct populations.

In [3]:
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")

Calculate QC metrics.

In [4]:
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
) # scanpy.pp is a module in the Scanpy library specifically for preprocessing scRNA-seq!
# In this tutorial, note that sc.__.function indicates modules in the Scanpy library.

Inspect violin plots of some of the computed QC metrics:

  • the number of genes expressed in the count matrix
  • the total counts per cell
  • the percentage of counts in mitochondrial genes
In [5]:
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
No description has been provided for this image

Combine those three QC metrics of the violin plots jointly by inspecting a scatter plot colored by pct_counts_mt:

In [6]:
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")
No description has been provided for this image

Now, remove cells that have too many mitochondrial genes expressed or too many total counts-- start with a very permissive filter: filter cells with less than 100 genes expressed and genes that are detected in less than 3 cells.

  • Why? "For single-cell RNA sequencing (scRNA-seq), cells with high mitochondrial content are traditionally excluded due to the assumption that they are dying or of poor quality

Note: for datasets with multiple batches, quality control should be performed for each sample individually as quality control thresholds can vary substantially between batches.

In [7]:
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=3)

Doublet detection¶

  • Scrublet predicts cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets.
  • We can remove doublets by either filtering out the cells called as doublets, or waiting until we’ve done a clustering pass and filtering out any clusters with high doublet scores.
  • Tutorial has more info on different doublet detection/filtering methods.
In [8]:
sc.pp.scrublet(adata, batch_key="sample")

Normalization¶

  • Another preprocessing step (so far, did quality control and doublet removal)
  • A common approach is count depth scaling with subsequent log plus one (log1p) transformation.
  • In this example, we are applying median count depth normalization with log1p transformation (AKA log1PF)
In [9]:
# Saving count data
adata.layers["counts"] = adata.X.copy()
In [10]:
# Normalizing to median total counts
sc.pp.normalize_total(adata)
# Logarithmize the data
sc.pp.log1p(adata)

Feature selection¶

  • To reduce the dimensionality of the dataset and only include the most informative genes, perform feature selection (gene selection).
  • scanpy function pp.highly_variable_genesannotates highly variable genes by reproducing the implementations of Seurat, Cell Ranger, or another Seurat version, based on the flavor.
  • Nothing specified here
In [11]:
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="sample")
In [12]:
sc.pl.highly_variable_genes(adata)
No description has been provided for this image

Dimensionality Reduction¶

  • PCA reveals the main axes of variation and denoises the data
  • PCA transforms a large set of variables into a smaller set of new, uncorrelated variables called "principal components"
  • By keeping only the first few principal components (those that capture the majority of the variance), reduce the number of dimensions
In [13]:
sc.tl.pca(adata) # scanpy.tl for tools

Inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function leiden() or tsne().

Better to overestimate the number of principal components? "No significant downside"

In [14]:
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True) # scanpy.pl for plotting
No description has been provided for this image

Should also always plot the principal components to see if there are any potentially undesired features (e.g. batch effects (like if the data points separate based on the experimental batch and not the biological variation), or QC metrics) driving signifigant variation in this dataset!

In this case, there isn’t anything too alarming.

In [15]:
sc.pl.pca(
    adata,
    color=["sample", "sample", "pct_counts_mt", "pct_counts_mt"],
    dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
    ncols=2,
    size=2,
)
No description has been provided for this image

Nearest neighbor graph construction and visualization with UMAPs¶

Compute the neighborhood graph of cells using the PCA representation of the data matrix.

What is a UMAP? from the Allen Institute and PCA vs. UMAP vs. t-SNE Reddit thread:

  • dimensionality reduction tool, like PCA; but PCA is linear, UMAP and tSNE are both non-linear and non-deterministic methods based on ordering the points into neighbor graphs.
  • Follow along with "How are UMAPs used to represent transcriptomic data?" visualization on the Allen Institute website for a broad sense of our project (maybe bring this up with mentors to see if this is kind of similar to our project)...
  • heatmap made from UMAP that compares the level of gene expression between each cell type (on X axis) for specific genes of interest (Y axis)
  • We can filter, color-code, or alter UMAPs to analyze them. For example, the Allen Institute website shows how you can apply colored filters based on a particular characteristic of the cell, such as gene expression...
In [16]:
sc.pp.neighbors(adata) # using default parameter n_neighbors = 15 in sc.pp.neigbors()

This graph can then be embedded in two dimensions for visualization with UMAP!

In [17]:
sc.tl.umap(adata) # using default parameter min_dist = 0.5 in sc.tl.umap())

We can now visualize the UMAP according to the sample.

In [18]:
sc.pl.umap(
    adata,
    color="sample",
    # Setting a smaller point size to get prevent overlap
    size=2,
)
No description has been provided for this image

Even though the data considered in this tutorial includes two different samples, we only observe a minor batch effect. Let's continue with clustering and annotation of our data.

What if we see batch effects in our UMAP?

  • Integrate across samples and perform batch correction/integration.
  • scanorama and scvi-tools for batch integration.

Clustering (still working with UMAPs)¶

  • Recommend the Leiden graph-clustering method (community detection based on optimizing modularity)
In [19]:
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
sc.tl.leiden(adata, flavor="igraph", n_iterations=2)
In [20]:
sc.pl.umap(adata, color=["leiden"])
No description has been provided for this image

Re-assess quality control and cell filtering¶

  • Re-assess our filtering strategy by visualizing different QC metrics using UMAP
In [21]:
sc.pl.umap(
    adata,
    color=["leiden", "predicted_doublet", "doublet_score"],
    # increase horizontal space between panels
    wspace=0.5,
    size=3,
)
No description has been provided for this image
In [22]:
sc.pl.umap(
    adata,
    color=["leiden", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts"],
    wspace=0.5,
    ncols=2,
)
No description has been provided for this image

Manual cell-type annotation¶

This tutorial simplifies cell-type annotation.

We have now reached a point where we have obtained a set of cells with decent quality, and we can proceed to their annotation to known cell types. Typically, this is done using genes that are exclusively expressed by a given cell type, or in other words these genes are the marker genes of the cell types, and are thus used to distinguish the heterogeneous groups of cells in our data.

Commonly and classically, cell type annotation uses those marker genes subsequent to the grouping of the cells into clusters. So, let’s generate a set of clustering solutions which we can then use to annotate our cell types. Here, we will use the Leiden clustering algorithm which will extract cell communities from our nearest neighbours graph.

  • Single Cell Best Practices annotation chapter: a single cell might not have a count for a specific marker even if it was expressed in that cell, simply due to the inherent sparsity of single cell data. Clustering enables the detection of cells highly similar in overall gene expression and can therefore account for drop-outs at single cell level.
In [23]:
for res in [0.02, 0.5, 2.0]:
    sc.tl.leiden(adata, key_added=f"leiden_res_{res:4.2f}", resolution=res, flavor="igraph")

Notably, the number of clusters that we define is largely arbitrary, and so is the resolution parameter that we use to control for it. As such, the number of clusters is ultimately bound to the stable and biologically-meaningful groups that we can ultimately distringuish.

In [24]:
sc.pl.umap(
    adata,
    color=["leiden_res_0.02", "leiden_res_0.50", "leiden_res_2.00"],
    legend_loc="on data",
)
No description has been provided for this image

Though UMAPs should not be over-interpreted, here we can already see that in the highest resolution our data is over-clustered, while the lowest resolution is likely grouping cells which belong to distinct cell identities.

Marker gene set¶

Let’s define a set of marker genes for the main cell types that we expect to see in this dataset. These were adapted from Single Cell Best Practices annotation chapter, for a more detailed overview and best practices in cell type annotation, we refer the user to it.

In [26]:
marker_genes = {
    "CD14+ Mono": ["FCN1", "CD14"],
    "CD16+ Mono": ["TCF7L2", "FCGR3A", "LYN"],
    # Note: DMXL2 should be negative
    "cDC2": ["CST3", "COTL1", "LYZ", "DMXL2", "CLEC10A", "FCER1A"],
    "Erythroblast": ["MKI67", "HBA1", "HBB"],
    # Note HBM and GYPA are negative markers
    "Proerythroblast": ["CDK6", "SYNGR1", "HBM", "GYPA"],
    "NK": ["GNLY", "NKG7", "CD247", "FCER1G", "TYROBP", "KLRG1", "FCGR3A"],
    "ILC": ["ID2", "PLCG2", "GNLY", "SYNE1"],
    "Naive CD20+ B": ["MS4A1", "IL4R", "IGHD", "FCRL1", "IGHM"],
    # Note IGHD and IGHM are negative markers
    "B cells": [
        "MS4A1",
        "ITGB1",
        "COL4A4",
        "PRDM1",
        "IRF4",
        "PAX5",
        "BCL11A",
        "BLK",
        "IGHD",
        "IGHM",
    ],
    "Plasma cells": ["MZB1", "HSP90B1", "FNDC3B", "PRDM1", "IGKC", "JCHAIN"],
    # Note PAX5 is a negative marker
    "Plasmablast": ["XBP1", "PRDM1", "PAX5"],
    "CD4+ T": ["CD4", "IL7R", "TRBC2"],
    "CD8+ T": ["CD8A", "CD8B", "GZMK", "GZMA", "CCL5", "GZMB", "GZMH", "GZMA"],
    "T naive": ["LEF1", "CCR7", "TCF7"],
    "pDC": ["GZMB", "IL3RA", "COBLL1", "TCF4"],
}
In [27]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden_res_0.02", standard_scale="var")
No description has been provided for this image

There are fairly clear patterns of expression for our markers show here, which we can use to label our coarsest clustering with broad lineages.

In [28]:
adata.obs["cell_type_lvl1"] = adata.obs["leiden_res_0.02"].map(
    {
        "0": "Lymphocytes",
        "1": "Monocytes",
        "2": "Erythroid",
        "3": "B Cells",
    }
)
In [30]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden_res_0.50", standard_scale="var")
No description has been provided for this image

This seems like a resolution that is suitable to distinguish most of the different cell types in our data. As such, let’s try to annotate those by manually using the dotplot above, together with the UMAP of our clusters. Ideally, one would also look specifically into each cluster, and attempt to subcluster those if required.

Differentially-expressed Genes as Markers: Wilcoxon and t-test¶

Furthermore, one can also calculate marker genes per cluster and then look up whether we can link those marker genes to any known biology, such as cell types and/or states.

This is typically done using simple statistical tests, such as Wilcoxon and t-test, for each cluster vs the rest.

In [31]:
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res_0.50", method="wilcoxon")

We can then visualize the top 5 differentially-expressed genes on a dotplot.

In [32]:
sc.pl.rank_genes_groups_dotplot(adata, groupby="leiden_res_0.50", standard_scale="var", n_genes=5)
WARNING: dendrogram data not found (using key=dendrogram_leiden_res_0.50). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
No description has been provided for this image

We can then use these genes to figure out what cell types we’re looking at. For example, Cluster 7 is expressing NKG7 and GNLY, suggesting these are NK cells.

To create your own plots, or use a more automated approach, the differentially expressed genes can be extracted in a convenient format with scanpy.get.rank_genes_groups_df()

In [33]:
sc.get.rank_genes_groups_df(adata, group="7").head(5)
Out[33]:
names scores logfoldchanges pvals pvals_adj
0 SAT1 27.752102 3.758692 1.643582e-169 3.850420e-165
1 MT-CO1 26.531887 2.455697 4.156276e-155 4.868454e-151
2 NEAT1 26.461023 3.385525 2.724711e-154 2.127727e-150
3 MT-ND2 25.010853 2.511811 4.658379e-138 2.728296e-134
4 MT-ND1 24.578140 2.427374 2.164193e-133 1.014011e-129
In [34]:
dc_cluster_genes = sc.get.rank_genes_groups_df(adata, group="7").head(5)["names"]
sc.pl.umap(
    adata,
    color=[*dc_cluster_genes, "leiden_res_0.50"],
    legend_loc="on data",
    frameon=False,
    ncols=3,
)
No description has been provided for this image

You may have noticed that the p-values found here are extremely low. This is due to the statistical test being performed considering each cell as an independent sample. For a more conservative approach you may want to consider “pseudo-bulking” your data by sample (e.g. sc.get.aggregate(adata, by=["sample", "cell_type"], func="sum", layer="counts")) and using a more powerful differential expression tool, like pydeseq2.

In [ ]: